home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Tech Arsenal 1
/
Tech Arsenal (Arsenal Computer).ISO
/
tek-15
/
phbench.zip
/
ARTICLE
< prev
next >
Wrap
Text File
|
1993-01-04
|
15KB
|
399 lines
[The following article appeared in "C Users Journal" May 1988.
It describes the purpose and use of the enclosed benchmarks. ]
SIMPLE BENCHMARKS FOR C COMPILERS
by Thomas Plum
Dr.Plum is the author of several books on C, including Efficient C (co-
authored with Jim Brodie). He is Vice-Chair of the ANSI X3J11 Committee,
and Chairman of Plum Hall Inc, which offers introductory and advanced sem-
inars on C.
Copyright (c) 1988, Plum Hall Inc
We are placing into the public domain some simple benchmarks with several
appealing properties:
They are short enough to type while browsing at trade shows.
They are protected against overly-aggressive compiler optimizations.
They reflect empirically-observed operator frequencies in C programs.
They give a C programmer information directly relevant to programming.
In Efficient C, Jim Brodie and I described how useful it can be for a pro-
grammer to have a general idea of how many microseconds it takes to execute
the "average operator" on register int's, on auto short's, on auto
long's, and on double data, as well as the time for an integer multiply,
and the time to call-and-return from a function. These six numbers allow a
programmer to make very good first-order estimates of the CPU time that a
particular algorithm will take.
The following easily-typed benchmark programs determine these times
directly. The first one is benchreg.c ("benchmark for register opera-
tors"):
- 1 -
- 2 -
1 /* benchreg - benchmark for register integers
2 * Thomas Plum, Plum Hall Inc, 609-927-3770
3 * If machine traps overflow, use an unsigned type
4 * Let T be the execution time in milliseconds
5 * Then average time per operator = T/major usec
6 * (Because the inner loop has exactly 1000 operations)
7 */
8 #define STOR_CL register
9 #define TYPE int
10 #include <stdio.h>
11 main(ac, av)
12 int ac;
13 char *av[];
14 {
15 STOR_CL TYPE a, b, c;
16 long d, major, atol();
17 static TYPE m[10] = {0};
18
19 major = atol(av[1]);
20 printf("executing %ld iterations0, major);
21 a = b = (av[1][0] - '0');
22 for (d = 1; d <= major; ++d)
23 {
24 /* inner loop executes 1000 selected operations */
25 for (c = 1; c <= 40; ++c)
26 {
27 a = a + b + c;
28 b = a >> 1;
29 a = b % 10;
30 m[a] = a;
31 b = m[a] - b - c;
32 a = b == c;
33 b = a | c;
34 a = !b;
35 b = a + c;
36 a = b > c;
37 }
38 }
39 printf("a=%d0, a);
40 }
If you enter this and compile it to produce an executable program, you can
invoke it with one argument, the number of iterations for the major loop:
benchreg 10000
If this execution takes 16 seconds, this means that the average register
operation takes 1.6 microseconds (16,000 milliseconds divided by 10,000
iterations of the major loop).
Let us examine the program in detail. Lines 8 and 9 define STOR_CL
("storage class") and TYPE to be register and int . Thus, on line 15,
three variables ( a , b , and c ) are declared to be of this storage class
and type. At line 16, the major loop control variables are long integers,
but they are touched only one one-thousandth as often as the inner loop
- 3 -
variables, so they have little effect upon the timings. We are declaring
the atol function to return a long integer; it would otherwise default
to an int return. (If we were using a compiler based upon draft ANSI C,
we could #include <stdlib.h> to get the declaration of atol , but this
would limit the applicability of the benchmarks. This simple declaration is
all that even an ANSI compiler would need.)
At line 19, we set the major loop variable to the number given on the com-
mand line, and at line 20, we confirm it to the output.
Line 21 is crucial to preventing some overly aggressive optimizations. Ear-
lier versions of these benchmarks had simply initialized a and b to 1,
but this allows a compiler to forward-propagate a known constant value. The
expression av[1][0] gives the first digit-character of the command-line
argument; subtracting '0' produces a digit between 0 and 9. (Yes, the
latest ANSI draft now guarantees that the digit characters are a contiguous
sequence in any environment.)
Line 22 simply executes the major loop the number of times given by the
variable major . Line 25 repeats the inner loop 40 times, and with 25
operators in that loop, this produces 1000 operators. (Actually there are
1003, because of the initialization and the extra increment and test at loop
completion. The discrepancy is well within acceptable tolerances.)
Within the inner loop, 40% of the operators are assignments, in keeping with
the percentages reported in the original Drhystone work. Of the other
operators, the most frequent are plus and minus. The sequence of operations
is carefully chosen to ensure that a very aggressive optimizer cannot find
any useless code sections; each result depends functionally upon previous
results.
Finally, the printout at line 39 is also important to preventing over-
optimization. If the compiler could notice that we did nothing with the
computed result, it could discard all the operations that produced that
result.
We have completed our perusal of the first benchmark program, benchreg.c .
The second program ( benchsho.c , for short's) is derived from benchreg.c
by changing lines 8 and 9: STOR_CL becomes auto , and TYPE becomes
short . The program is otherwise unchanged.
The third program ( benchlng.c , for long's) is obtained by leaving
STOR_CL as auto and changing TYPE to long .
To make the fourth program ( benchmul.c , for multiplies) we set TYPE to
int , and change lines 27 through 36 to one source line which does 25 multi-
plies:
a = 3 *a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a; /* 25 * */
The fifth program ( benchfn.c , for functions) is a major rewrite. We
arrange a series of function definitions for f3 , f2 , f1 , and f0 such
that each call to function f0 generates exactly 1000 function-call opera-
tions. In case the compiler has an aggressive optimizer, move the function
f3 to a separate source file, so that the compiler cannot see how useless
- 4 -
it is. The global variable dummy will make the compiler think that f3
might be up to something useful. Here, then, is the benchfn.c function:
1 /* benchfn - benchmark for function calls
2 * Thomas Plum, Plum Hall Inc, 609-927-3770
3 * Let T be the execution time in milliseconds
4 * Then average time per operator = T/major usec
5 * (Because the inner loop has exactly 1000 operations)
6 */
7 #include <stdio.h>
8 int dummy = 0;
9
10 /* f3 - lowest level function
11 * Put this in separate source file if compiler detects and
12 * optimizes useless code
13 */
14 f3() { }
15
16 f2() { f3();f3();f3();f3();f3();f3();f3();f3();f3();f3();} /* 10 */
17 f1() { f2();f2();f2();f2();f2();f2();f2();f2();f2();f2();} /* 10 */
18 f0() { f1();f1();f1();f1();f1();f1();f1();f1();f1();} /* 9 */
19
20 main(ac, av)
21 int ac;
22 char *av[];
23 {
24 long d, major, atol();
25
26 major = atol(av[1]);
27 printf("executing %ld iterations0, major);
28 for (d = 1; d <= major; ++d)
29 f0(); /* executes 1000 calls */
30 printf("dummy=%d0, dummy);
31 }
The sixth program ( benchdblc. , for double's ) is derived from benchlng.c
by changing STOR_CL to auto , TYPE to double , and replacing the inner
loop body with this slightly different version:
a = a + b + c;
b = a * 2;
a = b / 10;
a = -a;
b = -a - b - c;
a = b == c;
b = a + c;
a = !b;
b = a + c;
a = b > c;
These changes are necessary because floating-point operands are not allowed
for the shift, remainder, and bitwise operators, and because the subscript
operator does not really exercise the floating-point instructions. This
revised inner loop still gives us a representative mix of typical opera-
tions.
- 5 -
This, then, completes our collection of six benchmark programs. After they
are compiled to produce executable programs, the next question is "How do I
time the execution?"
On UNIX systems, the timing is easy -- just run the time command:
$ time benchreg 10000
The sum of the "user" and "system" times will give the CPU time used by the
program.
More accurately, we could time the execution of zero iterations, and sub-
tract that time from the time for the measured number of iterations.
On MS-DOS systems, timings can be obtained, but with greater difficulty. If
we create a file named CR-LF which contains just one newline (or
"carriage-return-newline" in DOS parlance), we could time our program with a
"batch" file such as this:
time <cr-lf
benchreg 0
time <cr-lf
benchreg 10000
time <cr-lf
We must then take times that are expressed in minutes-and-seconds and pro-
duce differences expressed in seconds.
With whichever method, we eventually produce six numbers that are character-
istic of a particular environment (a specific compiler supporting a specific
machine).
[NOTE: Since this article appeared, I have added a driver program, benches.c.
In an ANSI environment with the clock function, it will run all the tests
and report the results, eliminating the need for manual computations.]
Here are some examples of timing results that have been obtained on a
variety of minicomputer and workstation environments:
- 6 -
Machine/compiler register auto auto int func auto
int short long multiply call dbl
AT&T 3B2/05 (-O) 1.36 3.87 2.62 15.4 7.7 22.5
AT&T 3B2/05 (no -O) 1.78 4.66 2.75 16.2 9.3 22.5
AT&T 3B2/400 (-O) 1.09 1.36 1.10 16.2 10.0(?) 91.4
AT&T 3B2/400 (no -O) 1.14 2.61 2.36 17.3 11.3 91.1
Apollo DN330 (-O) 1.36 .78 1.36 10.17 3.57
Apollo DN330 (no -O) 1.54 1.28 1.54 11.30 3.64
Apollo DN580 (-O) 1.03 .59 1.03 7.67 2.72
Apollo DN580 (no -O) 1.18 .97 1.18 8.48 2.77
Apollo DN660 (-O) 5.88 1.24 5.88 21.86 4.26
Apollo DN660 (no -O) 5.93 1.52 5.93 21.93 4.29
Cray X-MP (no vectors) .0567 .0656 .0822 .366 .821 .082
Masscomp 5500 3.18 2.7 4.9 30.8 7.3
Masscomp 5600 (-O) .45 .61 .46 2.83 1.04
Masscomp 5600 (no -O) .46 .78 .64 2.99 1.76
Pyramid 90X (-O) .85 1.04 .86 3.64 1.9 2.37
Pyramid 90X (no -O) .86 1.01 .86 3.65 1.8 2.34
Sequent (-O) 1.39 2.99 2.53 9.90 9.3
Sequent (no -O) 1.50 3.25 2.83 9.95 13.2
Sun 3/260HM (-O) .31 .48 .47 1.98 1.16
Sun 3/260HM (no -O) .36 .58 .57 1.99 1.62
Sun 3/75M (-O) .47 .77 .76 3.00 2.12
Sun 3/75M (no -O) .53 .95 .94 3.01 2.73
Sun 3/75M(4.2, -O) .50 .81 .83 2.85 1.5 20.7
Sun 3/75M(4.2, no -O) .54 1.00 1.01 2.97 2.7 21.1
Sun 3/75M(VM, -O) .46 .77 .75 2.96 2.1 20.8
Sun 3/75M(VM, no -O) .52 .96 .93 2.97 2.7 21.1
VAX 11/730 (-O) 4.00 9.80 6.20 16.2 42.8 12.4
VAX 11/730 (no -O) 4.73 10.2 7.45 16.57 51.5 17.0
VAX 11/780 (-O) 1.21 2.43 1.67 2.76 15.0 2.95
VAX 11/780 (BSD 4.2) 1.38 2.42 1.96 2.92 17.2
VAX 11/780 (UNIX 5.2) 1.24 2.48 1.79 2.72 15.7 3.89
VAX 11/780 (no -O) 1.29 2.51 1.85 2.70 16.7 3.89
VAX 11/785 (-O) .93 1.85 1.32 5.00 13.9 47.5
VAX 11/785 (no -O) 1.01 1.96 1.44 5.08 14.2 5.42
VAX 8650(UNIX -O) .236 .484 .298 .589 2.63 .578
VAX 8650(UNIX no -O) .258 .482 .316 .574 3.06 .791
VAX 8650(Ultrix -O) .23 .40 .29 .53 2.4 .56
VAX 8650(Ultrix no -O) .26 .41 .34 .56 2.8 .77
Notice that some of these timings were run before the benchdbl benchmark
had been written. There are no examples of the popular PC environments in
this table. If interested readers wish to run these benchmarks on their own
environments, I will endeavor to present these results in a future article.
Processor speeds are sometimes described in "MIPS" (millions of instructions
per second); using a value such as the number of register operators per
second in C might give rise to a "MOPS" measurement of more use to C pro-
grammers. Those of us who have tried these benchmarks have appreciated the
intuitive grasp that they give of the speed of current machines and com-
pilers. I hope that you too will find them of interest.